HEAD
There are 10 questions to this activity. Save your answers in word document that you will hand in on Moodle using a .pdf extension. Keep your script file in your GitHub folder and make sure that all changes are pushed to GitHub. Code for all questions should be clearly commented with the question number. You will include a link to your script file as a part of your final question in this activity. The assignment is worth 30 points.
I will provide a lot of code examples as a part of this Activity. As the course progresses, I will gradually provide less code. You will have to adapt this code in the questions. To do this successfully, you will likely need to refer to R documentation and online resources to thoroughly understand all parts of the code. Troubleshooting and learning to interpret code is a huge part of learning R! It often involves making a lot of mistakes and getting error messages. That’s great and normal! I’ve been working in R for 12 years, and I still generate a lot of mistakes and error messages! You’ll just get better at fixing them.
Common types of objects in R include vectors, matrices, and dataframes. Last class, we briefly learned about creating vectors in R. You assign a name to an object using the arrow: <- and can make a vector using c(). If you remember, R is vectorized so we can automatically apply vector or matrix math to an object. Running the name of an object will print out the object.
#make a vector of tree heights in meters
heights <- c(30,41,20,22)
#convert to cm
heights_cm <- heights*100
heights_cm
## [1] 3000 4100 2000 2200
R uses a very specific notation around vectors. It is helpful for subsetting. Anytime you see square brackets [ ] recognize that this is subsetting an object in R. Also note that R always counts starting at one (no zeros like other programming languages!). Whenever you see a colon :, know that you are indexing through multiple values.
#look at the first tree height
heights[1]
## [1] 30
#look at the 2nd and 3rd tree heights
heights[2:3]
## [1] 41 20
I like to think of matrices as essentially multiple columns of vectors in R. You can set up a matrix using the matrix() function. Functions are built in to R and will always have parenthesis () following the function name. Anything inside of the parentheses is called an argument. Here to build a matrix, you need a vector of numbers that fills in the matrix and specify the number of columns or rows. You can also specify how to fill in the matrix (do you go through all columns in each row?). If you ever aren’t sure of arguments for a function, you can use help or ?.
#get more info on the matrix function
help(matrix)
## starting httpd help server ... done
Now, make some example matrices. Note the differences in filling in by column vs row.
#set up a matrix with 2 columns and fill in by rows
#first argument is the vector of numbers to fill in the matrix
Mat<-matrix(c(1,2,3,4,5,6), ncol=2, byrow=TRUE)
Mat
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
#set up a matrix that fills in by columns
#first argument is the vector of numbers to fill in the matrix
Mat.bycol<-matrix(c(1,2,3,4,5,6), ncol=2, byrow=FALSE)
Mat.bycol
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
Notice when the both the matrices are run, the output now has square bracket labels with two spots. This notation will always be formatted [row,column]. You can use this notation to refer to specific spots in a matrix using the name of the matrix.
#subset the matrix to look at row 1, column2
Mat.bycol[1,2]
## [1] 4
An opening in a square bracket after or before a comma means refer to all items in that slot. You’ll notice the output is a vector when subsetting.
#look at all values in row 1
Mat.bycol[1,]
## [1] 1 4
#look at all values in column 2
Mat.bycol[,2]
## [1] 4 5 6
Today, we will look at weather patterns and extreme events from five sites across the US. I downloaded this data from NOAA, and I’ve included instructions for downloading this type of data for future reference.
Most data that you will work with is too large to be typed into R scripts. A lot of the data will be available in spreadsheet or text file format. You will want to work with this type of data as a dataframe. Dataframes are are matrices where columns have names and rows contain observations for the same entity. This works a lot like a spreadsheet, but we can apply our vector and matrix utilities in R. For example, we will read in a data file for weather data where columns contain data for temperature and precipitation, and each row has an observation for the same day and weather station. You may be used to working with data in excel spreadsheets with a .xlsx extension that is proprietary to Microsoft. This is not ideal to work with, and we will work with .csv (comma separated values file). This file contains the text version of an excel spreadsheet where the values for each cell are included in the file and the designation of a new cell starts with a comma. You can always resave an .xlsx file as a .csv. Let’s read in a file.
Today we are going to download some data to our computers - on the Moodle page for our class if have created a data section. Download the folder noaa_weather to your computer. Save it somewhere intuitive, and make a note of the file path. You will need to specify the file path in order to read in the data.I would recommend creating a data folder nested within a folder for this class on your computer. Note that the data come as a compressed .zip file which you will need to extract. On a PC you should be able to right click and select Extract All, and on a Mac simply double click on the .zip file to extract. In either case the end results will be a folder named noaa_weather that includes documentation on the data, information about the data request that I made, and general instructions for accessing this weather data. Often R does better reading in file paths with two *\* in a PC environment so we will use that designation in our file path. You will also need quotes around your file path.
#read in weather station file from your data folder
# Here is my PC file path - note my data folder is on Google Drive, so I can access it from multiple computers
datW <- read.csv("G:\\My Drive\\Documents\\teaching\\GEOG331\\F20\\data\\noaa_weather\\2011124.csv",
stringsAsFactors = T)
# Here is my Mac file path
#datW <- read.csv("/Volumes/GoogleDrive/My Drive/Documents/teaching/GEOG331/F20/data/noaa_weather/2011124.csv")
You’ll also find in the folder the full documentation of the data (metadata) and information about the download (for example indicates that I specified metric units). Let’s take a look at the data. You can view the dataframe in Rstudio in the Environment tab by clicking on the dataframe name datW. You can also find out more information about the dataframe as shown below:
#get more information about the dataframe
str(datW)
## 'data.frame': 157849 obs. of 9 variables:
## $ STATION : Factor w/ 5 levels "USC00025700",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ NAME : Factor w/ 5 levels "ABERDEEN, WA US",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ LATITUDE : num 42.8 42.8 42.8 42.8 42.8 ...
## $ LONGITUDE: num -75.7 -75.7 -75.7 -75.7 -75.7 ...
## $ ELEVATION: num 512 512 512 512 512 ...
## $ DATE : Factor w/ 32841 levels "1930-01-01","1930-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ PRCP : num 1.3 0 0 5.1 0 5.3 0 7.4 0 0 ...
## $ TMAX : num 2.8 7.2 3.9 -1.1 -1.7 10 11.7 11.1 11.7 -2.2 ...
## $ TMIN : num -7.8 1.7 -2.2 -12.8 -21.7 -8.3 -3.3 7.8 -5 -12.8 ...
In the output, you’ll see the number of rows and columns of the data frame in the first line and a preview of the first few rows of data in each preview. You’ll notice there are two types of data: numeric and char or character strings. In the NAME column, you’ll see there are 5 unique site names. More on this type of data later.
Before we go any further, we are going to want to change the date format from a factor so it is a little more useful.
#specify a column with a proper date format
#note the format here dataframe$column
datW$dateF <- as.Date(datW$DATE, "%Y-%m-%d")
#google date formatting in r to find more options and learn more
We also will find it helpful to have a year column that is numeric for working with the data. You can use the format function. This will output a date formatted to just include year in a character data type. Since this data is simply a numeric date, we will indicate that it should actually be numeric using the as.numeric function. Note how R lets you nest functions here.
#create a date column by reformatting the date to only include years
#and indicating that it should be treated as numeric data
datW$year <- as.numeric(format(datW$dateF,"%Y"))
Let’s run some descriptive statistics on the weather data. If you look at the snapshot of the metadata (full file on the server) below, you will find information about the observations in each column. Note that I downloaded with the metric option.
Let’s start by looking at the data using some basic summary statistics. The mean is a measure of central tendency of our data. You have probably calculated it at some point by adding all of you observations and dividing by the total number of observations. In some fields, this may be referred to as an average for samples, and mean may be used more specifically for a probability distribution (more on that later!). There is a built in function in R to calculate a mean. We will also want to calculate the standard deviation. This measures the spread of the observations around the mean, and keeps the units the same as the mean. We’ll discuss this idea further and cover it in lecture.
Also note that we have data from five sites with very different climates. We will want to describe mean annual patterns separately.
We could start by looking at the mean of a single site using the mean function. We can check all site names using the levels function. You can subset to a single site by using square brackets and indicating that we want a vector for all TMAX values where the NAME is equal to ABERDEEN, WA US.
#find out all unique site names
unique(datW$NAME)
## [1] MORRISVILLE 6 SW, NY US LIVERMORE, CA US
## [3] ABERDEEN, WA US MORMON FLAT, AZ US
## [5] MANDAN EXPERIMENT STATION, ND US
## 5 Levels: ABERDEEN, WA US ... MORRISVILLE 6 SW, NY US
#look at the mean maximum temperature for Aberdeen
mean(datW$TMAX[datW$NAME == "ABERDEEN, WA US"])
## [1] NA
You get a NA value here. That’s because there is missing data in this data set. NA is a specification that allows you to know that the data is missing and we should not expect a value. NA is handled differently in R and is neither a number or character. Luckily there is an argument in mean that allows us to ignore NAs in the calculation.
#look at the mean maximum temperature for Aberdeen
#with na.rm argument set to true to ingnore NA
mean(datW$TMAX[datW$NAME == "ABERDEEN, WA US"], na.rm=TRUE)
## [1] 14.68515
Now you will see the average daily maximum temperature in Aberdeen is 14.68 °C. Since this is the average across many decades of observations, this value can help tell us about the climate of the site. Before we move on, average daily temperature is often more helpful to evaulate temperature. The average is always halfway between the minimum and maximum. We can calculate it as follows:
#calculate the average daily temperature
#This temperature will be halfway between the minimum and maximum temperature
datW$TAVE <- datW$TMIN + ((datW$TMAX-datW$TMIN)/2)
The above method of calculating means is not very efficient. However we can use the aggregate function to calculate means across and indexing value.
#get the mean across all sites
#the by function is a list of one or more variables to index over.
#FUN indicates the function we want to use
#if you want to specify any function specific arguments use a comma and add them after the function
#here we want to use the na.rm arguments specific to
averageTemp <- aggregate(datW$TAVE, by=list(datW$NAME), FUN="mean",na.rm=TRUE)
averageTemp
## Group.1 x
## 1 ABERDEEN, WA US 10.432268
## 2 LIVERMORE, CA US 15.381992
## 3 MANDAN EXPERIMENT STATION, ND US 5.568997
## 4 MORMON FLAT, AZ US 22.019010
## 5 MORRISVILLE 6 SW, NY US 6.655229
#change the automatic output of column names to be more meaningful
#note that MAAT is a common abbreviation for Mean Annual Air Temperature
colnames(averageTemp) <- c("NAME","MAAT")
averageTemp
## NAME MAAT
## 1 ABERDEEN, WA US 10.432268
## 2 LIVERMORE, CA US 15.381992
## 3 MANDAN EXPERIMENT STATION, ND US 5.568997
## 4 MORMON FLAT, AZ US 22.019010
## 5 MORRISVILLE 6 SW, NY US 6.655229
The names in this dataset are long. Let’s convert the factor data to the underlying numbers to reference them more quickly. You will have to reference the number again for the name.
#convert level to number for factor data type
#you will have to reference the level output or look at the row of data to see the character designation.
datW$siteN <- as.numeric(datW$NAME)
Let’s take a look at how often different temperature values are observed at each site. You’ll use a graphical tool called a histogram. A histogram shows the frequency of temperature observations in different bins.
#make a histogram for the first site in our levels
#main= is the title name argument.
#Here you want to paste the actual name of the factor not the numeric index
#since that will be more meaningful.
hist(datW$TAVE[datW$siteN == 1],
freq=FALSE,
main = paste(levels(datW$NAME)[1]),
xlab = "Average daily temperature (degrees C)",
ylab="Relative frequency",
col="grey50",
border="white")
To get a better idea of how the summary statistics describe the data, let’s add lines to the plot to designate where the mean and standard deviation are located. Note we’ll use the standard deviation function. R abbreviates this function name as sd. The abline function allows us to add lines to a plot. The v argument in this function means add a vertical line. I’ll add a red solid line for the mean and red dashed lines for the standard deviation from the mean.
#make a histogram for the first site in our levels, Aberdeen
#main= is the title name argument.
#Here you want to paste the actual name of the factor not the numeric index
#since that will be more meaningful.
hist(datW$TAVE[datW$siteN == 1],
freq=FALSE,
main = paste(levels(datW$NAME)[1]),
xlab = "Average daily temperature (degrees C)",
ylab="Relative frequency",
col="grey50",
border="white")
#add mean line with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
col = "tomato3",
lwd = 3)
#add standard deviation line below the mean with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE) - sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
col = "tomato3",
lty = 3,
lwd = 3)
#add standard deviation line above the mean with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE) + sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
col = "tomato3",
lty = 3,
lwd = 3)
You’ll notice that one standard deviation encompasses much of the data. If we went 2 standard deviations out, almost all of the data would be included.
The data distribution that we just viewed has a very particular shape. Temperature observations are most frequent around the mean and we rarely observe data 2 standard deviations from the mean. The distribution is also symmetrical. We can describe this formally with a probability distribution. Probability distributions have a lot of mathematical properties that are useful. We use parameters to help describe the shape of how data is distributed. This temperature data follows a normal distribution. This type of distribution is very common and relies on two parameters: the mean and standard deviation to describe the data distribution. Below is an image of the normal distribution where zero represents the value at the mean of the data and tick marks are designated with standard deviations.
Probability distributions all have functions in R. Below, I show code for using the dnorm function to calculate the probability density for a range of temperature values to add to the plot .
#make a histogram for the first site in our levels
#main= is the title name argument.
#Here you want to paste the actual name of the factor not the numeric index
#since that will be more meaningful.
#note I've named the histogram so I can reference it later
h1 <- hist(datW$TAVE[datW$siteN == 1],
freq=FALSE,
main = paste(levels(datW$NAME)[1]),
xlab = "Average daily temperature (degrees C)",
ylab="Relative frequency",
col="grey50",
border="white")
#the seq function generates a sequence of numbers that we can use to plot the normal across the range of temperature values
x.plot <- seq(-10,30, length.out = 100)
#the dnorm function will produce the probability density based on a mean and standard deviation.
y.plot <- dnorm(seq(-10,30, length.out = 100),
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
#create a density that is scaled to fit in the plot since the density has a different range from the data density.
#!!! this is helpful for putting multiple things on the same plot
#!!! It might seem confusing at first. It means the maximum value of the plot is always the same between the two datasets on the plot. Here both plots share zero as a minimum.
y.scaled <- (max(h1$density)/max(y.plot)) * y.plot
#points function adds points or lines to a graph
#the first two arguements are the x coordinates and the y coordinates.
points(x.plot,
y.scaled,
type = "l",
col = "royalblue3",
lwd = 4,
lty = 2)
You can now see the blue dashed line overlaid on the histogram. This is the normal distribution using the mean and standard deviation calculated from the data. You’ll notice the normal distribution does a good job of modeling our data. Sometimes it underestimates a data range and at other points it overestimates it, but overall it mirrors the distribution of our data. This means we can rely on properties of the normal to help describe our data statistically!
Now let’s revisit how we can use probability think about the occurrence of different air temperature ranges. For a given value of the data, the Normal distribution has a probability density associated with observing the value. The probability density doesn’t mean anything at a given value of the data. We can’t really do anything with that particular number. However, when the normal distribution is integrated across a range of values, it yields a probability for the occurrence of the range of values. For those of you that haven’t had calculus, integrating is essentially taking the area under the curve between a range of numbers. We have to keep in mind that the range of the normal distribution extends from -\(\infty\) to \(\infty\). Let’s start by taking a look at all values below freezing in the normal distribution for our Aberdeen weather data. Technically this is the probability of all temperatures below freezing from zero to -\(\infty\). Functionally we know some low temperatures would be impossible to observe on earth and the probability of observing values closer to -\(\infty\) will be minuscule. Below is a graphical representation of the area of the curve that describes the probability of observing a daily temperature below zero.
Luckily we don’t have to do any of the work calculating the probability. R has a built in suite of functions for working with probability distributions. Below is an image of the documentation for all functions related to the normal distribution.Run the documentation on dnorm to see them all:
help(dnorm)
R uses p to designate probability. Let’s calculate the probability of below freezing temperatures.Don’t forget that probabilities always range from 0 to 1. We would have to go integrate across -\(\infty\) to \(\infty\) to get a probability of 1 in the normal.
#pnorm(value to evaluate at (note this will evaluate for all values and below),mean, standard deviation)
pnorm(0,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 0.01682526
You can see below freezing temperatures are rare at this site and we only expect them to occur about 1% of the time. I can take advantage of the properties of the distribution and add and subtract areas under the curve to better tailor my ranges of numbers. For example, I might be interested in identifying how often a temperatures between 0-5 degrees occur.
#pnrom with 5 gives me all probability (area of the curve) below 5
pnorm(5,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 0.1343358
However, if I subtract the area below 0 from that number, I will get the probability of temperatures in the range of 0-5.
#pnrom with 5 gives me all probability (area of the curve) below 5
pnorm(5,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))- pnorm(0,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 0.1175105
Now let’s evaluate the probability of high temperatures. Knowing that the entire distribution adds up to 1, we can also find the area above a value. For example, let’s look at the probability of temperatures above 20 degrees C.
#pnorm of 20 gives me all probability (area of the curve) below 20
#subtracting from one leaves me with the area above 20
1 - pnorm(20,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 0.02569572
The qnorm function will return the value associated with a probability. This is the value in which all values at or below the value equal that probability. Let’s use this to evaluate extreme weather events. Let’s assume everything that occurs with a probability of less than 10% of the time (either hot or cold so anything above 95% or anything below 5%) is unusual. Let’s examine what unusually high temperatures in Aberdeen start at:
#pnorm of 20 gives me all probability (area of the curve) below 20
#subtracting from one leaves me with the area above 20
qnorm(0.95,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 18.51026
Note: Throughout all of my code examples, you’ll notice that I continued to copy and paste the same code for calculating the mean for site 1: mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE). While I did this to help you remember what was going into the function, it gets confusing and messy in long functions. This is a perfect example of why we name variables (with short clear names!) to refer to later on. As this course progresses, we’ll continue to work on creating clean code once you get more comfortable with R.
Daily patterns in precipitation are often less useful than looking at total precipitation throughout the entire year. Use your coding knowledge from above to complete the following questions.
There are 10 questions to this activity. Save your answers in word document that you will hand in on Moodle using a .pdf extension. Keep your script file in your GitHub folder and make sure that all changes are pushed to GitHub. Code for all questions should be clearly commented with the question number. You will include a link to your script file as a part of your final question in this activity. The assignment is worth 30 points.
I will provide a lot of code examples as a part of this Activity. As the course progresses, I will gradually provide less code. You will have to adapt this code in the questions. To do this successfully, you will likely need to refer to R documentation and online resources to thoroughly understand all parts of the code. Troubleshooting and learning to interpret code is a huge part of learning R! It often involves making a lot of mistakes and getting error messages. That’s great and normal! I’ve been working in R for 12 years, and I still generate a lot of mistakes and error messages! You’ll just get better at fixing them.
Common types of objects in R include vectors, matrices, and dataframes. Last class, we briefly learned about creating vectors in R. You assign a name to an object using the arrow: <- and can make a vector using c(). If you remember, R is vectorized so we can automatically apply vector or matrix math to an object. Running the name of an object will print out the object.
#make a vector of tree heights in meters
heights <- c(30,41,20,22)
#convert to cm
heights_cm <- heights*100
heights_cm
## [1] 3000 4100 2000 2200
R uses a very specific notation around vectors. It is helpful for subsetting. Anytime you see square brackets [ ] recognize that this is subsetting an object in R. Also note that R always counts starting at one (no zeros like other programming languages!). Whenever you see a colon :, know that you are indexing through multiple values.
#look at the first tree height
heights[1]
## [1] 30
#look at the 2nd and 3rd tree heights
heights[2:3]
## [1] 41 20
I like to think of matrices as essentially multiple columns of vectors in R. You can set up a matrix using the matrix() function. Functions are built in to R and will always have parenthesis () following the function name. Anything inside of the parentheses is called an argument. Here to build a matrix, you need a vector of numbers that fills in the matrix and specify the number of columns or rows. You can also specify how to fill in the matrix (do you go through all columns in each row?). If you ever aren’t sure of arguments for a function, you can use help or ?.
#get more info on the matrix function
help(matrix)
## starting httpd help server ... done
Now, make some example matrices. Note the differences in filling in by column vs row.
#set up a matrix with 2 columns and fill in by rows
#first argument is the vector of numbers to fill in the matrix
Mat<-matrix(c(1,2,3,4,5,6), ncol=2, byrow=TRUE)
Mat
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
#set up a matrix that fills in by columns
#first argument is the vector of numbers to fill in the matrix
Mat.bycol<-matrix(c(1,2,3,4,5,6), ncol=2, byrow=FALSE)
Mat.bycol
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
Notice when the both the matrices are run, the output now has square bracket labels with two spots. This notation will always be formatted [row,column]. You can use this notation to refer to specific spots in a matrix using the name of the matrix.
#subset the matrix to look at row 1, column2
Mat.bycol[1,2]
## [1] 4
An opening in a square bracket after or before a comma means refer to all items in that slot. You’ll notice the output is a vector when subsetting.
#look at all values in row 1
Mat.bycol[1,]
## [1] 1 4
#look at all values in column 2
Mat.bycol[,2]
## [1] 4 5 6
Today, we will look at weather patterns and extreme events from five sites across the US. I downloaded this data from NOAA, and I’ve included instructions for downloading this type of data for future reference.
Most data that you will work with is too large to be typed into R scripts. A lot of the data will be available in spreadsheet or text file format. You will want to work with this type of data as a dataframe. Dataframes are are matrices where columns have names and rows contain observations for the same entity. This works a lot like a spreadsheet, but we can apply our vector and matrix utilities in R. For example, we will read in a data file for weather data where columns contain data for temperature and precipitation, and each row has an observation for the same day and weather station. You may be used to working with data in excel spreadsheets with a .xlsx extension that is proprietary to Microsoft. This is not ideal to work with, and we will work with .csv (comma separated values file). This file contains the text version of an excel spreadsheet where the values for each cell are included in the file and the designation of a new cell starts with a comma. You can always resave an .xlsx file as a .csv. Let’s read in a file.
Today we are going to download some data to our computers - on the Moodle page for our class if have created a data section. Download the folder noaa_weather to your computer. Save it somewhere intuitive, and make a note of the file path. You will need to specify the file path in order to read in the data.I would recommend creating a data folder nested within a folder for this class on your computer. Note that the data come as a compressed .zip file which you will need to extract. On a PC you should be able to right click and select Extract All, and on a Mac simply double click on the .zip file to extract. In either case the end results will be a folder named noaa_weather that includes documentation on the data, information about the data request that I made, and general instructions for accessing this weather data. Often R does better reading in file paths with two *\* in a PC environment so we will use that designation in our file path. You will also need quotes around your file path.
#read in weather station file from your data folder
# Here is my PC file path - note my data folder is on Google Drive, so I can access it from multiple computers
datW <- read.csv("G:\\My Drive\\Documents\\teaching\\GEOG331\\F20\\data\\noaa_weather\\2011124.csv",
stringsAsFactors = T)
# Here is my Mac file path
#datW <- read.csv("/Volumes/GoogleDrive/My Drive/Documents/teaching/GEOG331/F20/data/noaa_weather/2011124.csv")
You’ll also find in the folder the full documentation of the data (metadata) and information about the download (for example indicates that I specified metric units). Let’s take a look at the data. You can view the dataframe in Rstudio in the Environment tab by clicking on the dataframe name datW. You can also find out more information about the dataframe as shown below:
#get more information about the dataframe
str(datW)
## 'data.frame': 157849 obs. of 9 variables:
## $ STATION : Factor w/ 5 levels "USC00025700",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ NAME : Factor w/ 5 levels "ABERDEEN, WA US",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ LATITUDE : num 42.8 42.8 42.8 42.8 42.8 ...
## $ LONGITUDE: num -75.7 -75.7 -75.7 -75.7 -75.7 ...
## $ ELEVATION: num 512 512 512 512 512 ...
## $ DATE : Factor w/ 32841 levels "1930-01-01","1930-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ PRCP : num 1.3 0 0 5.1 0 5.3 0 7.4 0 0 ...
## $ TMAX : num 2.8 7.2 3.9 -1.1 -1.7 10 11.7 11.1 11.7 -2.2 ...
## $ TMIN : num -7.8 1.7 -2.2 -12.8 -21.7 -8.3 -3.3 7.8 -5 -12.8 ...
In the output, you’ll see the number of rows and columns of the data frame in the first line and a preview of the first few rows of data in each preview. You’ll notice there are two types of data: numeric and char or character strings. In the NAME column, you’ll see there are 5 unique site names. More on this type of data later.
Before we go any further, we are going to want to change the date format from a factor so it is a little more useful.
#specify a column with a proper date format
#note the format here dataframe$column
datW$dateF <- as.Date(datW$DATE, "%Y-%m-%d")
#google date formatting in r to find more options and learn more
We also will find it helpful to have a year column that is numeric for working with the data. You can use the format function. This will output a date formatted to just include year in a character data type. Since this data is simply a numeric date, we will indicate that it should actually be numeric using the as.numeric function. Note how R lets you nest functions here.
#create a date column by reformatting the date to only include years
#and indicating that it should be treated as numeric data
datW$year <- as.numeric(format(datW$dateF,"%Y"))
Let’s run some descriptive statistics on the weather data. If you look at the snapshot of the metadata (full file on the server) below, you will find information about the observations in each column. Note that I downloaded with the metric option.
Let’s start by looking at the data using some basic summary statistics. The mean is a measure of central tendency of our data. You have probably calculated it at some point by adding all of you observations and dividing by the total number of observations. In some fields, this may be referred to as an average for samples, and mean may be used more specifically for a probability distribution (more on that later!). There is a built in function in R to calculate a mean. We will also want to calculate the standard deviation. This measures the spread of the observations around the mean, and keeps the units the same as the mean. We’ll discuss this idea further and cover it in lecture.
Also note that we have data from five sites with very different climates. We will want to describe mean annual patterns separately.
We could start by looking at the mean of a single site using the mean function. We can check all site names using the levels function. You can subset to a single site by using square brackets and indicating that we want a vector for all TMAX values where the NAME is equal to ABERDEEN, WA US.
#find out all unique site names
unique(datW$NAME)
## [1] MORRISVILLE 6 SW, NY US LIVERMORE, CA US
## [3] ABERDEEN, WA US MORMON FLAT, AZ US
## [5] MANDAN EXPERIMENT STATION, ND US
## 5 Levels: ABERDEEN, WA US ... MORRISVILLE 6 SW, NY US
#look at the mean maximum temperature for Aberdeen
mean(datW$TMAX[datW$NAME == "ABERDEEN, WA US"])
## [1] NA
You get a NA value here. That’s because there is missing data in this data set. NA is a specification that allows you to know that the data is missing and we should not expect a value. NA is handled differently in R and is neither a number or character. Luckily there is an argument in mean that allows us to ignore NAs in the calculation.
#look at the mean maximum temperature for Aberdeen
#with na.rm argument set to true to ingnore NA
mean(datW$TMAX[datW$NAME == "ABERDEEN, WA US"], na.rm=TRUE)
## [1] 14.68515
Now you will see the average daily maximum temperature in Aberdeen is 14.68 °C. Since this is the average across many decades of observations, this value can help tell us about the climate of the site. Before we move on, average daily temperature is often more helpful to evaulate temperature. The average is always halfway between the minimum and maximum. We can calculate it as follows:
#calculate the average daily temperature
#This temperature will be halfway between the minimum and maximum temperature
datW$TAVE <- datW$TMIN + ((datW$TMAX-datW$TMIN)/2)
The above method of calculating means is not very efficient. However we can use the aggregate function to calculate means across and indexing value.
#get the mean across all sites
#the by function is a list of one or more variables to index over.
#FUN indicates the function we want to use
#if you want to specify any function specific arguments use a comma and add them after the function
#here we want to use the na.rm arguments specific to
averageTemp <- aggregate(datW$TAVE, by=list(datW$NAME), FUN="mean",na.rm=TRUE)
averageTemp
## Group.1 x
## 1 ABERDEEN, WA US 10.432268
## 2 LIVERMORE, CA US 15.381992
## 3 MANDAN EXPERIMENT STATION, ND US 5.568997
## 4 MORMON FLAT, AZ US 22.019010
## 5 MORRISVILLE 6 SW, NY US 6.655229
#change the automatic output of column names to be more meaningful
#note that MAAT is a common abbreviation for Mean Annual Air Temperature
colnames(averageTemp) <- c("NAME","MAAT")
averageTemp
## NAME MAAT
## 1 ABERDEEN, WA US 10.432268
## 2 LIVERMORE, CA US 15.381992
## 3 MANDAN EXPERIMENT STATION, ND US 5.568997
## 4 MORMON FLAT, AZ US 22.019010
## 5 MORRISVILLE 6 SW, NY US 6.655229
The names in this dataset are long. Let’s convert the factor data to the underlying numbers to reference them more quickly. You will have to reference the number again for the name.
#convert level to number for factor data type
#you will have to reference the level output or look at the row of data to see the character designation.
datW$siteN <- as.numeric(datW$NAME)
Let’s take a look at how often different temperature values are observed at each site. You’ll use a graphical tool called a histogram. A histogram shows the frequency of temperature observations in different bins.
#make a histogram for the first site in our levels
#main= is the title name argument.
#Here you want to paste the actual name of the factor not the numeric index
#since that will be more meaningful.
hist(datW$TAVE[datW$siteN == 1],
freq=FALSE,
main = paste(levels(datW$NAME)[1]),
xlab = "Average daily temperature (degrees C)",
ylab="Relative frequency",
col="grey50",
border="white")
To get a better idea of how the summary statistics describe the data, let’s add lines to the plot to designate where the mean and standard deviation are located. Note we’ll use the standard deviation function. R abbreviates this function name as sd. The abline function allows us to add lines to a plot. The v argument in this function means add a vertical line. I’ll add a red solid line for the mean and red dashed lines for the standard deviation from the mean.
#make a histogram for the first site in our levels, Aberdeen
#main= is the title name argument.
#Here you want to paste the actual name of the factor not the numeric index
#since that will be more meaningful.
hist(datW$TAVE[datW$siteN == 1],
freq=FALSE,
main = paste(levels(datW$NAME)[1]),
xlab = "Average daily temperature (degrees C)",
ylab="Relative frequency",
col="grey50",
border="white")
#add mean line with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
col = "tomato3",
lwd = 3)
#add standard deviation line below the mean with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE) - sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
col = "tomato3",
lty = 3,
lwd = 3)
#add standard deviation line above the mean with red (tomato3) color
#and thickness of 3
abline(v = mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE) + sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
col = "tomato3",
lty = 3,
lwd = 3)
You’ll notice that one standard deviation encompasses much of the data. If we went 2 standard deviations out, almost all of the data would be included.
The data distribution that we just viewed has a very particular shape. Temperature observations are most frequent around the mean and we rarely observe data 2 standard deviations from the mean. The distribution is also symmetrical. We can describe this formally with a probability distribution. Probability distributions have a lot of mathematical properties that are useful. We use parameters to help describe the shape of how data is distributed. This temperature data follows a normal distribution. This type of distribution is very common and relies on two parameters: the mean and standard deviation to describe the data distribution. Below is an image of the normal distribution where zero represents the value at the mean of the data and tick marks are designated with standard deviations.
Probability distributions all have functions in R. Below, I show code for using the dnorm function to calculate the probability density for a range of temperature values to add to the plot .
#make a histogram for the first site in our levels
#main= is the title name argument.
#Here you want to paste the actual name of the factor not the numeric index
#since that will be more meaningful.
#note I've named the histogram so I can reference it later
h1 <- hist(datW$TAVE[datW$siteN == 1],
freq=FALSE,
main = paste(levels(datW$NAME)[1]),
xlab = "Average daily temperature (degrees C)",
ylab="Relative frequency",
col="grey50",
border="white")
#the seq function generates a sequence of numbers that we can use to plot the normal across the range of temperature values
x.plot <- seq(-10,30, length.out = 100)
#the dnorm function will produce the probability density based on a mean and standard deviation.
y.plot <- dnorm(seq(-10,30, length.out = 100),
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
#create a density that is scaled to fit in the plot since the density has a different range from the data density.
#!!! this is helpful for putting multiple things on the same plot
#!!! It might seem confusing at first. It means the maximum value of the plot is always the same between the two datasets on the plot. Here both plots share zero as a minimum.
y.scaled <- (max(h1$density)/max(y.plot)) * y.plot
#points function adds points or lines to a graph
#the first two arguements are the x coordinates and the y coordinates.
points(x.plot,
y.scaled,
type = "l",
col = "royalblue3",
lwd = 4,
lty = 2)
You can now see the blue dashed line overlaid on the histogram. This is the normal distribution using the mean and standard deviation calculated from the data. You’ll notice the normal distribution does a good job of modeling our data. Sometimes it underestimates a data range and at other points it overestimates it, but overall it mirrors the distribution of our data. This means we can rely on properties of the normal to help describe our data statistically!
Now let’s revisit how we can use probability think about the occurrence of different air temperature ranges. For a given value of the data, the Normal distribution has a probability density associated with observing the value. The probability density doesn’t mean anything at a given value of the data. We can’t really do anything with that particular number. However, when the normal distribution is integrated across a range of values, it yields a probability for the occurrence of the range of values. For those of you that haven’t had calculus, integrating is essentially taking the area under the curve between a range of numbers. We have to keep in mind that the range of the normal distribution extends from -\(\infty\) to \(\infty\). Let’s start by taking a look at all values below freezing in the normal distribution for our Aberdeen weather data. Technically this is the probability of all temperatures below freezing from zero to -\(\infty\). Functionally we know some low temperatures would be impossible to observe on earth and the probability of observing values closer to -\(\infty\) will be minuscule. Below is a graphical representation of the area of the curve that describes the probability of observing a daily temperature below zero.
Luckily we don’t have to do any of the work calculating the probability. R has a built in suite of functions for working with probability distributions. Below is an image of the documentation for all functions related to the normal distribution.Run the documentation on dnorm to see them all:
help(dnorm)
R uses p to designate probability. Let’s calculate the probability of below freezing temperatures.Don’t forget that probabilities always range from 0 to 1. We would have to go integrate across -\(\infty\) to \(\infty\) to get a probability of 1 in the normal.
#pnorm(value to evaluate at (note this will evaluate for all values and below),mean, standard deviation)
pnorm(0,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 0.01682526
You can see below freezing temperatures are rare at this site and we only expect them to occur about 1% of the time. I can take advantage of the properties of the distribution and add and subtract areas under the curve to better tailor my ranges of numbers. For example, I might be interested in identifying how often a temperatures between 0-5 degrees occur.
#pnrom with 5 gives me all probability (area of the curve) below 5
pnorm(5,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 0.1343358
However, if I subtract the area below 0 from that number, I will get the probability of temperatures in the range of 0-5.
#pnrom with 5 gives me all probability (area of the curve) below 5
pnorm(5,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))- pnorm(0,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 0.1175105
Now let’s evaluate the probability of high temperatures. Knowing that the entire distribution adds up to 1, we can also find the area above a value. For example, let’s look at the probability of temperatures above 20 degrees C.
#pnorm of 20 gives me all probability (area of the curve) below 20
#subtracting from one leaves me with the area above 20
1 - pnorm(20,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 0.02569572
The qnorm function will return the value associated with a probability. This is the value in which all values at or below the value equal that probability. Let’s use this to evaluate extreme weather events. Let’s assume everything that occurs with a probability of less than 10% of the time (either hot or cold so anything above 95% or anything below 5%) is unusual. Let’s examine what unusually high temperatures in Aberdeen start at:
#pnorm of 20 gives me all probability (area of the curve) below 20
#subtracting from one leaves me with the area above 20
qnorm(0.95,
mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE),
sd(datW$TAVE[datW$siteN == 1],na.rm=TRUE))
## [1] 18.51026
Note: Throughout all of my code examples, you’ll notice that I continued to copy and paste the same code for calculating the mean for site 1: mean(datW$TAVE[datW$siteN == 1],na.rm=TRUE). While I did this to help you remember what was going into the function, it gets confusing and messy in long functions. This is a perfect example of why we name variables (with short clear names!) to refer to later on. As this course progresses, we’ll continue to work on creating clean code once you get more comfortable with R.
Daily patterns in precipitation are often less useful than looking at total precipitation throughout the entire year. Use your coding knowledge from above to complete the following questions.